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Abstract 



Neural networks (NNs) provide a powerful and flexible tool for selecting a signal from a larger 
background. The D0 collaboration has used them extensively in studying tt decays. NNs were 
essential to the measurement of the tt production cross section in the all-jets channel (tt — > bbqqqq), 
and were also used in the measurement of the mass of the top quark in the lepton+jets channel 
(tt — > bblvqq). This paper will describe two new applications of neural networks to top quark 
analysis: the search for single top quark production, and an effort to increase the sensitivity in the 
dilepton channel tt — > bbejlvv beyond that achieved in the published analysis. 
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FIG. 1. A feed-forward neural network. 



I. INTRODUCTION 

Since the observation of the top quark in 1995 0, 
much experimental effort has been invested in studying 
its properties 0j . Such analyses are difficult, owing to the 
small number of it events available, the relatively large 
backgrounds, and the complex event geometries. There 
has therefore been a great deal of interest in analysis 
techniques that could improve on the standard methods 
of selecting candidate events. One useful class of such 
techniques uses pattern classifiers based on feed-forward 
"neural networks." || 

The D0 experiment at the Fermilab Tevatron has 
made considerable use of neural network techniques in 
its analyses of top quark data. Both the cross section 
measurement in the all-jets channel [Q and the mass 
measurement in the lepton + jets channel [gj used neu- 
ral networks; details of these analyses have already been 
published. 

Here, we describe two more recent studies: a neural 
network analysis of single top quark production, and an 
effort to improve the efficiency for selecting it — > e/i 
events using neural networks. We shall start with a brief 
description of the kind of neural networks used in these 
analyses. 

II. NEURAL NETWORKS 

Figure [l] shows an example of the type of neural net- 
work used in these studies. It consists of a set of process- 
ing units, each of which has at least one input and one 
output. The output y.; of a single unit i is given in terms 
of its inputs afy by 

Vi = 9(£2%ij + (!) 

3 

where 9{ is a threshold specific to the unit, and g is a 
nonlinear squashing function, typically of the form 



9(*) = (2) 

[Thus, the unit outputs are bounded in the range (0, 1).] 
The units are arranged in layers, with the inputs of layer 
n + 1 connected to the outputs of layer n by a weight 
matrix: 

*T = ( 3 ) 

Typically, the last layer consists of only one unit, and 
is called the "output" layer; the other layers are called 
"hidden" layers. Often, the are said to be the out- 
puts of a dummy "input" layer. No processing, however, 
is done in that "layer." Such a network is quite flexi- 
ble; in fact, it has been shown that a network with only 
one hidden layer can approximate any reasonable (Borel- 
measurable) function to any required degree of accuracy, 
provided that sufficient units are available in the hidden 
layer §]. 

For pattern recognition, one wants to have the network 
output 1 if the input is most consistent with signal, and 
if the input is most consistent with background. Typi- 
cally, one has available a collection of N inputs, some 
of which are known to be signal and some of which are 
known to be background. One defines an error function: 

1 N 

i=l 

where Oi is the output of the network for input i, and t{ 
is the desired output for that input. This quantity can be 
considered as a function of the weights w and thresholds 
0\ one then minimizes x with respect to these variables 
to achieve an approximation to the desired function. 

The minimization technique most often used is called 
"backpropagation," which is a sort of stochastic gradi- 
ent descent. Other minimization algorithms can also be 
used. This process is often referred to as "training" the 
network. 



III. SINGLE TOP QUARK PRODUCTION 

The first study we will examine is a search for single 
top quark production The processes relevant at the 
Tevatron are illustrated in Fig. g; the total cross sec- 
tions for these processes calculated at ncxt-to-leading or- 
der (NLO) are @: 

ctnlo (pp -> tbX + c.c.) = 0.724 ± 0.043 pb, (5) 
(TNhoipP ^ tqbX + c.c.) = 1.70 ± 0.27 pb. 

Such processes are interesting because they directly 
probe the W — t — b vertex. Assuming the Standard 
Model, measuring these cross sections gives a measure- 
ment of the Vtb element of the Cabibbo-Kobayashi- 
Maskawa (CKM) matrix. Such measurements are also 



4 



q t 




Q b 9 9 

FIG. 2. Feynman diagrams for single top quark production. 



sensitive to any new physics in the weak interactions of 
the top quark Q. 

After the decay of the top quark, the particles pro- 
duced in these processes are Wbb and Wbbq, possibly 
with additional jets from QCD radiative effects. This 
study looks for leptonic decays of the W boson, so the ini- 
tial event selection requires a high-p^ lepton, large miss- 
ing transverse energy ($ T ), and at least two jets. No 
6-tag is required in this study, in order to preserve sig- 
nal efficiency (but if information about a tagging muon 
is present, it will be used). 

The numbers of signal and background events expected 
to remain after this selection for D0's Run 1 (109 pb _1 ) 
are as follows: 



Process 


-^Y events 


tb 


2.1 


tqb 


5.1 


QCD multijet 


2411 


tt 


22.3 


Wbb 


11.4 


Wjj (c, s) 


51.8 


Wjj (g, u, d) 


1615.7 


WW 


36.9 


WZ 


5.3 



As can be seen, the background is huge compared to 
the signal, with the dominant background sources being 
QCD multijet production (with a jet misidentified as a 
lepton) and the production of W bosons with associated 
jets. 

A crucial step in a neural network analysis is the selec- 
tion of the variables used as input to the network. Adding 
more variables potentially increases the amount of infor- 
mation available to the network, but it also expands the 
space that must be searched during the minimization, 
making it more difficult to find a good minimum. In 
fact, with some procedures, adding variables of marginal 
utility can degrade the performance of a network. And 
while neural networks can in principle approximate any 
reasonable function, in practice complicated mappings 
may require too many hidden nodes for minimization to 
be practical. 

A useful observation is that the rate for a scattering 
process is greatest in the regions of phase space near sin- 
gularities in the corresponding matrix element ]iTj| |. If 
such singularities occur in different places for signal and 
background, then the dependence on the corresponding 
variables in which the singularities occur should differ 
strongly between signal and background. For example, 
the top quark production diagrams in Fig. @ have a singu- 
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FIG. 3. Typical Feynman diagrams for the Wjj and QCD 
backgrounds to single top quark production. 



larity at M t 2 = (pt + pw) 2 —> m 2 ■ I n contrast, the dom- 
inant background diagrams, illustrated in Fig. ^, have 
singularities at 



(Pgl +Pg2) 2 -> 0, 

*«,(giff2) = (pgl +Pgt -Pg) 2 °> 

= (Pgl - Pq) 2 
-- {Pg2-Pq) 2 



L q,gi 



0, 
- 0. 



(6) 
(7) 
(8) 
(9) 



These variables, however, are defined at the parton level, 
and cannot be directly measured, due to effects of QCD 
radiation, the unobserved neutrino, and unobserved mo- 
mentum that escapes down the beam pipe. In such a 
situation, it is better to use other variables that are re- 
lated to the singular variables, but can be derived directly 
from the observed final state. For example, the typical 
i-channel singular variable tij associated with the pro- 
duction of a light particle (or jet) / can be written 



kf = (Pf-Pif 



f- y f - 

se p T e 



\vr\ 



(10) 



where y§ is the total invariant mass of the produced 
system, Y is its total rapidity, and p T and yf are the 
transverse momentum and rapidity of the produced /. 

From these kinds of considerations, a nominal set of 
input variables can be defined as: 

Set 1: M jhj2 , M t , Y to t, Ptji, Vji, (H) 

PTj2> %2, PTjU, Vjl2, VI, 

where PTj\2 an d Vjii are the transverse momentum and 
rapidity of the system formed by the two highest pr jets, 
and Itot is the total rapidity of the center of mass of 
the initial partons, as reconstructed from the final state. 
The z-component of the momentum of the W boson is 
found by enforcing the M\y mass constraint in the lep- 
tonic W boson decay. Distributions of some of these 
variables are shown in Fig. ^. 

Figure || compares this set with the simpler sets: 



Set 2: PTjl, PTj2i H a n, i?Taii; 
Set 3: p Tj i, Ptj2> H alh H TaX[ , M t ; 



(12) 
(13) 
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FIG. 4. Distributions of kinematic variables for single top 
signal (dashed) and background (solid), either Wjj (top four) 
or QCD (bottom two). Units are in GeV/c 2 . 



where H a \\ = Y] Ef and if Tall — Y]^T /■ The compari- 
son is made by training a neural network for each of the 
sets on a sample of events consisting of top quark signal 
plus Wjj background. It is seen that the neural network 
built using Set 1 performs better than those using Set 2 
or Set 3. 

Figure [| also shows two other variations of the set of 
input variables. Set 4 is the same as Set 1, except that 
the variables H a \\ and ifraii are added. It is seen that 
this does worse than Set 1 — the additional variables do 
not add enough information to counteract the increase in 
the size of the minimization space. Set 5 adds to Set 1 
the widths Wj e t of the two jets and the pt of a ^-tagging 
muon (set to zero if there is no such tag). In this case, 
the added variables help: Set 5 has a lower \ 2 than any 
of the others. 

For the final analysis, a separate network is constructed 
for each of the major backgrounds, as shown in Fig. ||. 
The networks are trained using jetnet plj; the results 
for each network are shown in Fig. [?]. Figure^] shows that 
the network output from Monte Carlo models agrees well 
with the data. Finally, individual cuts are made on each 
of the five network outputs. Figure ^| compares the results 
of this to a more conventional analysis. It is seen that 
for a given background level, the neural network analysis 
provides several times the signal efficiency of conventional 
cuts. 



tb and Wjj networks (j = g,u,d) 




Ncycle (training rime) 

FIG. 5. Neural network yj 2 vs. training time, for different 
sets of input variables. The networks are trained on a sample 
consisting of the single top signal and Wjj background. 



IV. it DECAYS INTO efi 

The "golden" channel for observing tt decays has long 
been the dilepton mode tt — > W + bW~b — > ev^v^bb. 
Due to the presence of two leptons with different fla- 
vors, this channel has a very low background. However, 
compared to the channels in which one of the W bosons 
decays into jets, the e/i channel has a relatively small 
branching ratio — about 2.5%, versus about 15% for the 
e + jets channel. Therefore, any new analysis techniques 
that can increase efficiency for identifying signal in this 
channel while maintaining the low background level are 
welcome. 

This study starts from the published measurement of 
the it production cross section , which selects e[i can- 
didates as follows: 

• An electron with Et > 15 GeV and |?/| < 2.5. 

• A muon with pr > 15 GeV/c and \rj\ < 1.2. 

• $ T > 20 GeV. 

• At least two jets with E T > 20 GeV and \rj\ <2.5. 

• AR^ et > 0.5 and AR e4l > 0.25. (AR = 



V(A0)2 + (Ar,)2.) 
• H T > 120 GeV, where H T 



For the present study, this selection is relaxed by re- 
moving the cut on Ht and reducing the Tf, T and jet Et 
cuts to 15 GeV. This defines the sample used as input 
to the neural network. 
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FIG. 6. The structure of the neural networks used to re- 
ject each background. Each network has one hidden layer; 
the notation {n\,ri2,nz) gives the number of units in the 
input, hidden, and output layers of each network, respec- 
tively. The following variables are used in addition to those 
defined in the text. rij2 is 1 if the event has exactly 
two jets and otherwise; rijs is 1 if the event has three 
or more jets and otherwise. The jet for which the in- 
variant mass of the lepton, neutrino, and jet is closest to 
172 GeV/c 2 is denoted jbest', the notation j a ii — jbest means 
all jets except j best . Also, Ap T (W, j aU ) = pr(W) - ^jet^r*- 
AM(rnw,mji,nij2) = \mw — rn(ji, j2)\/mw , and 
AM t = \m{e,u,j best ) - 172|/172. 
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FIG. 7. Outputs of each of the neural networks for sin- 
gle top signal (dashed) and the indicated background (solid). 
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FIG. 8. A comparison of the combined output of the 
five networks for data (the open symbols) and a Monte 
Carlo model of signal and all backgrounds (the solid sym- 
bols). The individual network outputs are combined using 
1/Otot = (1/5) £- = i i/CW 



Classical and NN Cuts Efficiency. 
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FIG. 9. Comparison of signal/background efficiencies for 
NN and conventional analyses. Each point represents one 
specific set of cuts. 
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Distributions of variables tt (Dashed) and Z — >TT (Solid) 
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FIG. 10. Distributions of input variables to the tt neural 
network for signal (dashed) and Z — > tt background (solid). 



There are three major backgrounds to contend with: 
QCD jet production with jets misidentified as leptons, 
Z — ► tt — ► e/i, and WW — > efi events. A separate 
network is trained to separate the signal from each of 
the three backgrounds. Six variables are used as in- 



puts to each of the networks, these being E"f,, E^ t2 , 



H£ ts = 



Sjcts^T > M e p, and A<j> eli , except for the tt 
network, where E^ replaces E^ t2 . The input variables 
for the tt network are plotted in Fig. Each network 
has seven hidden units. The networks are trained (using 
jetnet) on equal numbers of it signal and background 
events (2000 of each for the QCD network, and 1000 of 
each for the other two). The outputs of the three net- 
works are combined, as usual, using 



O 



comb 
NN 



+ 



(14) 



Distributions of this variable for signal and background 
are shown in Fig. O. To define the candidate sample, a 
final cut of 0^™ lb > 0.88 is imposed, which was deter- 
mined by maximizing the expected relative significance, 
S/ctb- {&b is the uncertainty in the background esti- 
mate.) 

The resulting signal efficiencies and estimated back- 
grounds for D0's Run 1 (108 pb _1 ) are shown in Ta- 
ble | and Fig. [l^. Compared to the standard (published) 
analysis, it is seen that the neural network analysis in- 
creases the signal efficiency by about 10%. In addition, 
the background is also slightly lower, although this is 
harder to evaluate due to the large statistical errors in the 
QCD background sample. Further comparison is made 
in Fig. O. 




Output of the Neural Net (O Comb) 



FIG. 11. Distribution of O 
ground events. 



NN b for tt signal and back- 





Conventional 


NN 




analysis 


analysis 


Signal 


e x 


BR (%) 


mt = 170 GeV/c" 
m t = 175 GeV/c 2 
m t = 180 GeV/c 2 


0.349 ± 0.074 


0.386 ± 0.082 


0.368 ± 0.078 


0.402 ± 0.085 


0.388 ± 0.082 


0.420 ± 0.089 


Background 


N 


expected 


Z — > tt — > e/i 


0.10 ±0.10 


0.10 ±0.07 


WW -> e/i 


0.074 ± 0.020 


0.085 ± 0.023 


7* — > tt — > e/i 


0.006 ± 0.005 


0.007 ± 0.006 


Fakes 


0.083 ±0.126 


0.048 ±0.124 


Total 


0.26 ±0.16 


0.24 ±0.15 



TABLE I. A comparison of the results of the conventional 
and neural network tt — » e/i analyses. The numbers of back- 
ground events are normalized for D0's Run 1 (108 pb _1 ). 
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V. CONCLUSIONS 
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FIG. 12. The neural network analysis compared to the 
standard analysis, for D0's Run 1. (a) Efficiency times 
branching ratio (%); (b) Ratio of NN analysis efficiency to 
standard analysis efficiency; (c) Expected number of signal 
events. Uncertainties displayed are statistical only; the sys- 
tematic uncertainties (included in Table are highly corre- 
lated between the two analyses. 




Acceptance for Background 



FIG. 13. The neural network analysis compared to the 
standard analysis. Each point represents a different set of 
selection requirements. 



In both the analyses considered here, neural networks 
provide a significant improvement over conventional anal- 
ysis methods. We expect that such techniques will have 
a prominent place in the analysis of data from the up- 
coming Run 2 of the Tevatron. 
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